Reference:
- Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis
The majority of plots in this file are modifications of some plots from this book.
Also, check out
gglpot2cheat sheet
Reference:
The majority of plots in this file are modifications of some plots from this book.
Also, check out gglpot2 cheat sheet
gg stands for grammar of graphics, based on Leland Wilkinson’s idea that plot should be structured according to certain rules, like a sentence according to grammar.
Every ggplot2 plot has three key components:
data
aestetic mappings
layers
We use trees dataset to explain the three components of grammar of graphics:
g <- ggplot(data = trees, aes(x = Girth, y = Volume)) + geom_point() ## define plot g g ## plot the object (i.e. plot) g
data
You need to specify data set (i.e. data frame, table, tibble, etc.) from which you use the information to present in the plot. In the above example, we use data set trees, built in into base R.
aesthetic mappings
Each plot needs at least one set of aesthetic mappings between variables in the data and visual properties. The name “aesthetic mapping” may sound confusing. Aestethic mapping just means you tell ggplot which variables from the chosen data set will be used and how. Typically, you use function aes() within ggplot() to specify role of each variable used in the plot, as ilustrated in the above example. Use parameter
x for variable on \(x\)-axis (in the above example, it’s variable Girth )
y for variable on \(y\)-axis (in the above example, it’s variable Volume)
color (or colour) for variable represented by color of points/markers (if categorical, you will have different colors; if numerical, you will have differents shades)
shape for (categorical) variable represented by shape of markers (different categories - different marker shapes)
size for variable represented by size of markers (larger value - larger marker size)
layers
At least one layer is required, which describes how to render each observation. Layers are often created with a function whose name starts with geom_. For example:
geom_point() - for scatter plots
geom_line() - for curves (connected dots)
geom_smooth() - for fitted lines/curves.
geom_histogram() - for frequencies of numerical variables
geom_bar() - for frequencies of categorical variables
library(ggplot2) ## don't forget to load ggplot2 package head(trees)
Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7
g <- ggplot(trees, aes(x = Girth, y = Volume)) + geom_point() ## define plot g g ## print the plot g
g + geom_smooth(method="lm") + coord_cartesian(ylim=c(0,80)) ## add lin. reg. line to g
g + geom_smooth(method="lm", level=0.99) + coord_cartesian(ylim=c(0,80)) ## add lin. reg. line to g
g + geom_smooth(method="lm", se=FALSE) + coord_cartesian(ylim=c(0,80)) ## don't include conf. interval
The default gray background can be removed by adding "+ theme_bw()".
g + geom_smooth(method="lm", se=FALSE) + theme_bw()
Another example: using the data sets diamonds from ggplot2 package.
head(diamonds) ## dataset from ggplot2, NOT diamond from UsingR
# A tibble: 6 x 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=clarity))
head(airquality,3)
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3
month = recode_factor(airquality$Month, '5'="May", '6'="June", '7'="July",
'8'="August", '9'="September")
air = airquality[, c("Ozone","Solar.R","Wind")]
air$Month = month
air = air[!is.na(air$Ozone) & !is.na(air$Wind),]
head(air)
Ozone Solar.R Wind Month 1 41 190 7.4 May 2 36 118 8.0 May 3 12 149 12.6 May 4 18 313 11.5 May 6 28 NA 14.9 May 7 23 299 8.6 May
ggplot(air, aes(x = Ozone, y = Wind)) + geom_point()
Apart from Wind and Ozone represented on \(x\) and \(y\) axes, we can represent the categorical variable Month by color.
ggplot(data=air, aes(x=Wind,y=Ozone,color=Month)) +
geom_smooth(method="lm", se=F, color="blue") +
geom_point(alpha=0.5) + coord_cartesian(ylim=c(0,175))
g <- ggplot(air, aes(x = Wind, y = Ozone, color=Month)) +
geom_point(alpha=1/2) +
stat_smooth(method="lm", se=FALSE, fill=NA,
formula=y ~ poly(x, 2),color="blue") +
stat_smooth(method="loess", se=FALSE, color="red")
g
ggplot(air, aes(x=Wind, y=Ozone, color=Month, size=Solar.R)) + geom_point(alpha=1/2) +
geom_smooth(method="lm", se=F, formula=y~poly(x,2), color="blue")
Warning: Removed 5 rows containing missing values (geom_point).
Next, let’s look at the example of mpg data set from ggplot2 package, which has data regarding car mileage (miles per gallon)
head(mpg)
# A tibble: 6 x 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~ 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~ 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~ 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~ 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~ 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
ggplot(mpg, aes(displ, hwy)) + geom_point()
Apart from displacement and highway mileage being represented on \(x\) and \(y\) axes, we can represent the categorical variable class by color.
ggplot(mpg, aes(x = displ, y = hwy, color=class)) + geom_point(alpha=1/2) ## + geom_smooth(method="lm")
g <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
geom_point(alpha=1/2) +
stat_smooth(method="lm", se=FALSE, fill=NA,
formula=y ~ poly(x, 3),color="blue") +
stat_smooth(method="loess", se=FALSE, color="red")
g
ggplot2 graph
Package plotly has a function ggplotly() that converts ggplot2 graph into plotly graph.
library(plotly) ggplotly(g) %>% ## g is the graph from the previous slide layout(margin=list(l=200, t=60))
Apart from color representing class variable, we can also use shape to represent drv variable (4-, front-, rear-wheel drive).
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = class, shape = drv))
Lastly, we can use size of the markers to represent yet another variable, say, cty (city mileage). This way we use 2D plot to represent 5 variables. Also, we change the label of drv variable in the legend, into drive train.
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class, shape = drv, size=cty)) +
scale_shape_discrete("drive train") + ## name in the legend for drv variable
scale_size_continuous("city mileage") ## name in the legend for cty variable
Here is a simple example of a bar plot, using mtcars dataset.
head(mpg)
# A tibble: 6 x 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~ 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~ 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~ 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~ 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~ 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
# For vertical barplot, don't map a variable to y ggplot(data=mpg, aes(x=class)) + geom_bar(stat="count", width=0.7, fill="steelblue") + theme_minimal()
The following bar plots are variations on the theme from https://www.learnbyexample.org/r-bar-plot-ggplot2/
survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),
people=c(40, 50, 30, 15, 35, 20))
survey
fruit people 1 Apple 40 2 Banana 50 3 Grapes 30 4 Kiwi 15 5 Orange 35 6 Pears 20
ggplot(survey, aes(x=fruit, y=people, fill="red")) +
geom_bar(stat="identity")
g <- ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(face = "bold", color = "#993333",
size = 12, angle = 60, hjust=0.8))
g
library(plotly)
ggplotly(g) %>%
layout(margin=list(l=150, t=60)) %>%
config(displaylogo = FALSE)
In the following figures we plot Tesla stock price (ticker/symbol: TSLA) at the closing of each day from a year ago until the moment this file was last rendered, which is October 6, 2020. We also plot the histogram of daily volume, i.e. number of shares that changed their owner in each day in the last 365 days. In the first figure, the two are in the same plot, while in the second they are in two separate subplots of the same figure, vertically alligned. The code is flexible in the sense that if you only change the ticker TSLA to the ticker of some other company and render this file, the Tesla graphs would be replaced by the graphs of the corresonding company.
library(ggplot2) library(ggpubr) ## used for function ggarrange(), for ggplot subplots library(quantmod) ## used for getting stock data from Yahoo Finance library(timetk) ## needed for function tk_tbl(), to convert time series to tibble
ticker = "TSLA" ## change this to any other ticker
present = Sys.time() ## getting the current time, right now (when this file is rendering)
fromtime = present - 365*24*60*60 ## a year ago
## use getSymbols() from package quantmod to get time series (`xts` object) w/ stock data
tick_df = getSymbols(ticker, from=fromtime, to = present,
src = "yahoo", auto.assign=FALSE)
'getSymbols' currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
'loadSymbols' to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.
This message is shown once per session and may be disabled by setting
options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
Warning: 'indexClass<-' is deprecated.
Use 'tclass<-' instead.
See help("Deprecated") and help("xts-deprecated").
## create my time series with TSLA stock price time series
myts = tk_tbl(data=tick_df, rename_index = "date",
preserve_index = T)
## let's see the structure of the object myts (my time series)
str(myts)
Classes 'tbl_df', 'tbl' and 'data.frame': 251 obs. of 7 variables: $ date : Date, format: "2019-08-19" "2019-08-20" ... $ TSLA.Open : num 224 228 222 223 220 ... $ TSLA.High : num 228 229 223 225 221 ... $ TSLA.Low : num 222 225 218 218 211 ... $ TSLA.Close : num 227 226 221 222 211 ... $ TSLA.Volume : num 5309600 4125200 7794300 6559000 8538600 ... $ TSLA.Adjusted: num 227 226 221 222 211 ...
pricedf = getQuote(ticker) ## getQuote() is from quantmod package ## let's see what pricedf is str(pricedf)
'data.frame': 1 obs. of 8 variables: $ Trade Time: POSIXct, format: "2020-08-14 16:00:01" $ Last : num 1651 $ Change : num 29.7 $ % Change : num 1.83 $ Open : num 1665 $ High : num 1669 $ Low : num 1627 $ Volume : int 12373787
## get the current (i.e. the last) price;
## if the stock market is closed, it gives the price at closing
lasttime = format(pricedf$"Trade Time", tz="America/Phoenix",usetz=TRUE)
## text to be plotted on one of the graphs, with current stock price
mytext = paste(ticker," price: $",pricedf$Last, "\n", lasttime, sep="")
## using dplyr::pull() we create vectors of prices at closing
## as well as volumes (number of of shares sold/bought, i.e. that changed its owner)
closeprice = pull(.data=myts, paste(ticker,".Close",sep=""))
volumes = pull(.data=myts, paste(ticker,".Volume",sep=""))
if (closeprice[length(closeprice)]>closeprice[1]) {
mycolor = "seagreen"
} else {mycolor = "red2"}
g = ggplot(myts, aes(x=date, y=closeprice), color=mycolor)
k = ceiling(log10(max(volumes)/max(closeprice))) ##scaling factor; makes plot nicer
g <- g + geom_area(aes(y=volumes/10^k), alpha=0.7) + theme_bw() +
geom_line(color=mycolor) +
xlab("Time") +
ylab(paste("Close Price (in $)\nVolume of shares (in 10^",as.character(k),")",sep=""))
ggplotly(g)
g1 = g + geom_area(color="blue", fill=mycolor, alpha=0.5) +
ylab("Close Price") + theme_bw() +
geom_text(x=myts$date[10], y=max(closeprice),
label=mytext,
size=3, lineheight=1,
hjust="left", vjust="top")
g2 = ggplot(myts, aes(x=myts$date, y=volumes)) +
geom_area(aes(y=volumes), color="blue", fill="royalblue3") + theme_bw() +
xlab("Time") + ylab("Volume")
## use function ggarrange() from package ggpubr to make ## two ggplots below each other ggpubr::ggarrange(g1,g2, nrow=2, align="v")
Dataset msleep from ggplot2 package
library(ggplot2) head(msleep)
# A tibble: 6 x 11 name genus vore order conservation sleep_total sleep_rem sleep_cycle awake <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA 11.9 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 NA 7 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA 9.6 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767 9.6 # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>
g <- ggplot(msleep, aes(brainwt, bodywt)) +
scale_x_log10() +
scale_y_log10()
g
g + geom_point(aes(color = vore)) +
scale_color_manual(
values = c("red", "orange", "green", "blue"),
na.value = "grey50"
)
Warning: Removed 27 rows containing missing values (geom_point).
clrs = c(carni = "red", insecti = "orange",
herbi = "green", omni = "blue")
g + geom_point(aes(color = vore)) +
scale_color_manual(values = clrs)
Warning: Removed 32 rows containing missing values (geom_point).
p <- ggplot(mpg, aes(cty, hwy)) + geom_jitter(width = 0.1, height = 0.1) p + facet_wrap(~cyl)
mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f") & class != "2seater")
mpg2
# A tibble: 205 x 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~ 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~ 3 audi a4 2 2008 4 manual~ f 20 31 p comp~ 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~ 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~ 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~ 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~ 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~ 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~ 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~ # ... with 195 more rows
Facetting is an alternative to using aesthetics (like color, shape or size) to differentiate groups. Both techniques have strengths and weaknesses, based around the relative positions of the subsets. With facetting, each group is quite far apart in its own panel, and there is no overlap between the groups. This is good if the groups overlap a lot, but it does make small differences harder to see. When using aesthetics to differentiate groups, the groups are close together and may overlap, but small differences are easier to see.
df <- data.frame( x = rnorm(180, c(0, 2, 4)), y = rnorm(180, c(1, 2, 1)), z = letters[1:3] ##from english alphabet (`base` package) ) head(df,8)
x y z 1 0.25803542 0.7308333 a 2 1.72924639 3.2895742 b 3 5.47210456 -0.1804129 c 4 0.48750063 1.6734242 a 5 2.50323855 1.8514075 b 6 2.98597841 1.2579372 c 7 -0.03449611 1.4319008 a 8 0.35885873 2.5653356 b
ggplot(df, aes(x, y)) + geom_point(aes(color = z), size=3, alpha=0.5)
ggplot(df, aes(x, y)) + geom_point() + facet_wrap(~z)
Group comparison by showing all the group means in each panel:
df_sum <- df %>%
group_by(z) %>%
summarize(x = mean(x), y = mean(y)) %>%
rename(z2 = z)
ggplot(df, aes(x, y)) + geom_point() +
geom_point(data = df_sum, aes(color = z2), size = 4) +
facet_wrap(~z)
Group comparison by showing all the data in the background of each panel:
df2 <- dplyr::select(df, -z) ggplot(df, aes(x, y)) + geom_point(data = df2, color = "grey70", size=3, alpha=0.4) + geom_point(aes(color = z), size=3) + facet_wrap(~z)
Here are some maps
## data set USArrests from base R head(USArrests)
Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7
The data represent arrest rates per 100,000 people. For example, for Arizona, murder=8.1 means 8.1 arrests for murder per 100,000 people.
df <- data.frame(murder = USArrests$Murder, state = tolower(rownames(USArrests)))
map <- ggplot2::map_data("state")
m <- ggplot(data=df, aes(fill = murder)) +
geom_map(aes(map_id = state), map = map) +
expand_limits(x = map$long, y = map$lat)
ggplotly(m) %>% ## from plotly package
layout(xaxis=list(title=""), yaxis=list(title="")) ## from plotly package
az_counties <- map_data("county", "arizona") %>%
select(lon = long, lat, group, id = subregion)
head(az_counties, 10)
lon lat group id 1 -109.0453 35.99894 1 apache 2 -109.0511 34.95043 1 apache 3 -109.0511 34.95043 1 apache 4 -109.0511 34.57227 1 apache 5 -109.0568 33.77586 1 apache 6 -109.1656 33.77586 1 apache 7 -109.2917 33.77586 1 apache 8 -109.3261 33.78159 1 apache 9 -109.3490 33.77013 1 apache 10 -109.3605 33.75294 1 apache
ggplot(az_counties, aes(lon, lat, group = group)) +
geom_polygon(fill = "white", color = "blue") +
coord_quickmap() + theme_classic() +
theme(axis.ticks=element_blank(), ## don't show ticks
axis.text=element_blank(), ## don't show tick labels/text
axis.line=element_blank(), ## don't show line
axis.title=element_blank()) ## don't show axis name/title
Here we combine plot_usmap() from usmap package, and scale_fill_continuous() and theme() from ggplot2 package:
library(usmap)
plot_usmap(data = statepop, values = "pop_2015", color = "white") +
scale_fill_continuous(name = "Population (2015)", label = scales::comma) +
theme(legend.position = "right")
m <- plot_usmap(data = statepop, values = "pop_2015", color = "white") +
scale_fill_gradientn(name = "Population (2015)", colors = rev(rainbow(7))) +
theme(legend.position = "right")
ggplotly(m)
usmap::plot_usmap(regions="counties", include = "Arizona")